2018 Edition
A home is often the largest and most expensive purchase a person makes in his or her lifetime. Ensuring homeowners have a trusted way to monitor this asset is incredibly important.
In this competition, students are required to develop a full-fledged approach to make predictions about the future sale prices of homes. A full-fledged approach constist, at least, in the following steps:
Now, should you ask a home buyer to describe their dream house, they probably wouldn't begin with describing features such as the height of the basement ceiling or the proximity to a railroad. As you will see, the dataset we use in this competition proves that many more features influence price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in a small city in the US, this competition challenges you to predict the final price of each home.
Here's a brief version of what you'll find in the data description file.
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
Notebooks are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
The file should contain a header and have the following format:
Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.
You will find an example submission file within the data directory in the repository.
This challenge is going to be graded as a regular notebook for the AML labs. As a consequence, students should submit:
In summary, you will have to submit 2 files!
# ! pip install lightgbm
#! pip install xgboost
#! pip install missingno
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # Seaborn visualization library
import missingno as msno # Missingno package for visualizing missing data
from IPython.display import display
from time import time
%matplotlib inline
sns.set(style='ticks')
# Statistical packages used for transformations
from scipy.stats import f_oneway, skew
from scipy.special import boxcox1p
from sklearn.decomposition import PCA
# Metrics used for measuring the accuracy and performance of the models
from sklearn.metrics import mean_squared_error, r2_score
# Algorithms used for modeling
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV, Lasso, Ridge, BayesianRidge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
import xgboost as xgb
# Pipeline and scaling preprocessing will be used for models that are sensitive
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, RobustScaler, PolynomialFeatures
from sklearn.feature_selection import RFE, SelectKBest, f_regression, chi2
# Model selection packages used for sampling dataset and optimising parameters
from sklearn.model_selection import KFold, cross_val_predict, cross_val_score, GridSearchCV
# To ignore warning
import warnings
warnings.filterwarnings("ignore")
# Fix the random seed
RANDOM_STATE = 42
np.random.seed(seed=RANDOM_STATE)
# Define the path of the data
path_data = "challenge_data/"
# Parameters to enable/disable some steps
apply_boxcox = True
apply_featurizing = True
drop_highly_correlated = True
# Construct a dataframe from the training data csv file
train_data = pd.read_csv(path_data + "train.csv")
# Construct a dataframe from the test data csv file
test_data = pd.read_csv(path_data + "test.csv")
# Concatenate the training and test data into one dataframe. It will be useful when dealing with missing data
data = pd.concat([train_data, test_data])
data.head()
data.describe().T
# Drop the ID column
train_data.drop(['Id'], axis=1, inplace=True)
test_data.drop(['Id'], axis=1, inplace=True)
data.drop(['Id'], axis=1, inplace=True)
# Drop the sale price column
data.drop(['SalePrice'], axis=1, inplace=True)
def show_missing(df, n=20):
total_missing_data = df.isnull().sum().sort_values(ascending=False)
percentage = total_missing_data * 100 / len(df)
missing_pd = pd.concat([total_missing_data, percentage], axis = 1, keys = ['Total', 'Percentage'])
return missing_pd.head(n)
show_missing(data)
# Get the features having missing values
missingdata = data.columns[data.isnull().any()].tolist()
# Keep the data frame with only those features
only_missing = data[missingdata]
# Plot the nullity matrix
msno.matrix(only_missing);
with sns.axes_style("whitegrid"):
msno.bar(only_missing, log=True, figsize=(20,10));
df = data[['PoolArea']] # Keep only the PoolArea column
print("The number of times where the PoolArea feature is equal to 0:",
int(df[df.PoolArea == 0].count()))
print("The number of times where the PoolArea feature is not equal to 0 (houses having a swimming pool):",
int(df[df.PoolArea != 0].count()))
# Mark data with NaN values with True and others with False
df_null = data[['PoolQC']].isnull()
# Filter data where the four columns are equal to True and then count them
print("The number of times where the PoolQC feature is missing:", int(df_null[df_null.PoolQC == True].count()))
print("The number of times where the PoolQC feature is indicated:", int(df_null[df_null.PoolQC == False].count()))
df = data[['MiscVal']] # Keep only the MiscVal column
print("The number of times where the MiscVal feature is equal to 0:",
int(df[df.MiscVal == 0].count()))
print("The number of times where the MiscVal feature is not equal to 0 (houses having a miscellaneous feature like gararge, elevator...):",
int(df[df.MiscVal != 0].count()))
# Mark data with NaN values with True and others with False
df_null = data[['MiscFeature']].isnull()
# Filter data where the four columns are equal to True and then count them
print("The number of times where the MiscFeature feature is missing:", int(df_null[df_null.MiscFeature == True].count()))
print("The number of times where the MiscFeature feature is indicated:", int(df_null[df_null.MiscFeature != True].count()))
data[(df.MiscVal == 0) & (df_null.MiscFeature != True)][['MiscVal', 'MiscFeature']]
msno.heatmap(only_missing);
msno.dendrogram(only_missing);
def fillWithNone(df):
columns_to_fill = [
"PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageType",
"GarageCond", "GarageFinish", "GarageQual", "BsmtExposure", "BsmtFinType2",
"BsmtFinType1", "BsmtCond", "BsmtQual"
]
for column in columns_to_fill:
df[column].fillna("None", inplace=True)
fillWithNone(data)
show_missing(data, 10)
def replaceMissingValues(df):
df["LotFrontage"] = df.groupby(['Neighborhood'])['LotFrontage'].transform(
lambda x: x.fillna(x.median())
) # Linear feet of street connected to property
# In case of a neighborhood has only LotFrontage NaN values, we use the variable median
df["LotFrontage"].fillna(df["LotFrontage"].median(), inplace=True)
#df["GarageYrBlt"] = df["GarageYrBlt"].fillna(0) # Year garage was built = 0 (No Garage)
df["GarageYrBlt"].fillna(df["YrSold"], inplace=True) # Year garage was built = Year Sold (No Garage)
df["GarageArea"] = df["GarageArea"].fillna(0) # Size of garage in square feet = 0 (No Garage)
df["GarageCars"] = df["GarageCars"].fillna(0) # Size of garage in car capacity = 0 (No Garage)
df["MasVnrArea"] = df["MasVnrArea"].fillna(0) # Masonry veneer area in square feet = 0 <=> Masonry veneer type = None
df["MasVnrType"] = df["MasVnrType"].fillna("None") # Masonry veneer type = None
df["SaleType"] = df["SaleType"].fillna("Oth")
df["Electrical"] = df.groupby(['Neighborhood'])['Electrical'].transform(
lambda x: x.fillna(x.value_counts().index[0])
) # Electrical system
df["MSZoning"] = df.groupby(['Neighborhood'])['MSZoning'].transform(
lambda x: x.fillna(x.value_counts().index[0])
) # MSZoning is filled by the most common value in the neighborhood
df["Exterior1st"] = df.groupby(['Neighborhood'])['Exterior1st'].transform(
lambda x: x.fillna(x.value_counts().index[0])
) # Exterior1st
df["Exterior2nd"] = df.groupby(['Neighborhood'])['Exterior2nd'].transform(
lambda x: x.fillna(x.value_counts().index[0])
) # Exterior2nd
# If the the kitchen quality is missing
df["KitchenQual"] = df["KitchenQual"].fillna("None")
# If the values of bath are missing <=> no bath
df["HalfBath"] = df["HalfBath"].fillna(0)
df["FullBath"] = df["FullBath"].fillna(0)
# Missing values are likely zero (and none for categorical features) for having no basement
df["BsmtFinSF1"] = df["BsmtFinSF1"].fillna(0)
df["BsmtFinSF2"] = df["BsmtFinSF2"].fillna(0)
df["BsmtUnfSF"] = df["BsmtUnfSF"].fillna(0)
df["TotalBsmtSF"] = df["TotalBsmtSF"].fillna(0)
df["BsmtHalfBath"] = df["BsmtHalfBath"].fillna(0)
df["BsmtFullBath"] = df["BsmtFullBath"].fillna(0)
df["BsmtQual"] = df["BsmtQual"].fillna("None")
df["BsmtCond"] = df["BsmtCond"].fillna("None")
df["BsmtExposure"] = df["BsmtExposure"].fillna("None")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna("None")
df["BsmtFinType2"] = df["BsmtFinType2"].fillna("None")
df["Functional"] = df["Functional"].fillna("Typ") # Functional missing <=> Typical
# For this categorical feature almost all values are "AllPub", except for one "NoSeWa" and 2 NA .
# Since the houses with 'NoSewa' are in the training set, this feature won't be helpful.
# So, we can safely remove it.
df.drop(['Utilities'], axis=1, inplace=True)
replaceMissingValues(data)
data.isnull().any().any()
def impute_data(df):
"""
This function impute the missing data by using the function 'fillWithNone' to fill some missing values with 'None'
and the function 'replaceMissingValues' to replace the remaining missing values with appropriate ones.
"""
fillWithNone(df)
replaceMissingValues(df)
impute_data(train_data)
df = train_data[(train_data.GrLivArea > 4000) & (train_data.SalePrice < 4e5)]
grLivArea_outliers = df.GrLivArea.tolist()
salePrice_outliers = df.SalePrice.tolist()
plt.figure(figsize=(15,5))
sns.regplot('GrLivArea','SalePrice', data=train_data)
plt.title('SalePrice vs GrLivArea')
plt.xlabel('Above grade (ground) living area square feet')
# Annotate outliers
x_text = np.mean(grLivArea_outliers)
y_text = 4e5
for x, y in zip(grLivArea_outliers, salePrice_outliers):
plt.annotate("Outliers",
xy=(x, y), xycoords='data',
xytext=(x_text, y_text), textcoords='data',
size=15, va="center", ha="center",
bbox=dict(boxstyle="round4", fc="w", ec="r", pad=0.6),
arrowprops=dict(arrowstyle="simple",
connectionstyle="arc3,rad=-0.2",
relpos=(1., 0.),
fc="r")
)
train_data.drop(train_data[train_data.GrLivArea > 4000].index, inplace=True)
plt.figure(figsize=(15,5))
sns.regplot('GrLivArea','SalePrice', data=train_data)
plt.title('SalePrice vs GrLivArea')
plt.xlabel('Above grade (ground) living area square feet');
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='Neighborhood', data=train_data)
plt.title('Number of houses per neighborhood') # The plot title
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'SalePrice', data=train_data)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
g = sns.FacetGrid(train_data, col="Neighborhood", col_wrap=5, hue="Neighborhood")
g = g.map(sns.regplot, "GrLivArea", "SalePrice")
train_data['SalePricePerSquareFeet'] = train_data['SalePrice'] / train_data['GrLivArea']
mean_SalePricePerSquareFeet = np.mean(train_data.SalePricePerSquareFeet.tolist())
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'SalePricePerSquareFeet', data=train_data)
plt.axhline(y=mean_SalePricePerSquareFeet, color='r', linestyle='--') # Plot the mean
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.ylabel("Sale price per square feet");
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'OverallQual', data=train_data)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'OverallCond', data=train_data)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
train_data['AgeAtSell'] = train_data['YrSold'] - train_data['YearBuilt']
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'AgeAtSell', data=train_data)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.ylabel("Age at sell");
train_data['RemodelingAgeAtSell'] = train_data['YrSold'] - train_data['YearRemodAdd']
plt.figure(figsize=(20,10)) # The figure size
sns.barplot('Neighborhood', 'RemodelingAgeAtSell', data=train_data)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.ylabel("Remodeling Age at sell");
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('OverallQual', 'SalePrice', data=train_data)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall Quality')
plt.subplot(1, 3, 2)
sns.stripplot('OverallQual', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall Quality')
plt.subplot(1, 3, 3)
sns.boxplot('OverallQual', 'SalePrice', data=train_data, showmeans=True)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall Quality');
train_data['YearBuilt2'] = train_data['YearBuilt'].apply(lambda x: "[" + str(x - x %10) + "," + str(x - x %10 + 10) + "[")
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('YearBuilt2', 'OverallQual', data=train_data, order=sorted(train_data['YearBuilt2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built')
plt.ylabel("Overall material quality");
plt.subplot(1, 3, 2)
sns.stripplot('YearBuilt2', 'OverallQual', data=train_data, size = 5, jitter = True, order=sorted(train_data['YearBuilt2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built')
plt.ylabel("Overall material quality");
plt.subplot(1, 3, 3)
sns.boxplot('YearBuilt2', 'OverallQual', data=train_data, order=sorted(train_data['YearBuilt2'].unique().tolist()), showmeans=True)
plt.xticks(rotation=90); # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built')
plt.ylabel("Overall material quality");
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('OverallCond', 'SalePrice', data=train_data)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall condition of the house')
plt.ylabel("Sale price");
plt.subplot(1, 3, 2)
sns.stripplot('OverallCond', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall condition of the house')
plt.ylabel("Sale price");
plt.subplot(1, 3, 3)
sns.boxplot('OverallCond', 'SalePrice', data=train_data, showmeans=True)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.xlabel('Overall condition of the house')
plt.ylabel("Sale price");
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('YearBuilt2', 'SalePrice', data=train_data, order=sorted(train_data['YearBuilt2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built');
plt.subplot(1, 3, 2)
sns.stripplot('YearBuilt2', 'SalePrice', data=train_data, size = 5, jitter = True, order=sorted(train_data['YearBuilt2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built');
plt.subplot(1, 3, 3)
sns.boxplot('YearBuilt2', 'SalePrice', data=train_data, order=sorted(train_data['YearBuilt2'].unique().tolist()), showmeans=True)
plt.xticks(rotation=90); # Rotate the x-ticks with 45 degrees
plt.xlabel('Year Built');
train_data['YearRemodAdd2'] = train_data['YearRemodAdd'].apply(lambda x: "[" + str(x - x %10) + "," + str(x - x %10 + 10) + "[")
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('YearRemodAdd2', 'SalePrice', data=train_data, order=sorted(train_data['YearRemodAdd2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year of remodelling');
plt.subplot(1, 3, 2)
sns.stripplot('YearRemodAdd2', 'SalePrice', data=train_data, size = 5, jitter = True, order=sorted(train_data['YearRemodAdd2'].unique().tolist()))
plt.xticks(rotation=90) # Rotate the x-ticks with 45 degrees
plt.xlabel('Year of remodelling');
plt.subplot(1, 3, 3)
sns.boxplot('YearRemodAdd2', 'SalePrice', data=train_data, order=sorted(train_data['YearRemodAdd2'].unique().tolist()), showmeans=True)
plt.xticks(rotation=90); # Rotate the x-ticks with 45 degrees
plt.xlabel('Year of remodelling');
train_data['IsRemodeled'] = ((train_data['YearRemodAdd'] - train_data['YearBuilt']) > 0) * 1
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('IsRemodeled', 'SalePrice', data=train_data)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 2)
sns.stripplot('IsRemodeled', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 3)
sns.boxplot('IsRemodeled', 'SalePrice', data=train_data, showmeans=True)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='MSZoning', data=train_data)
plt.title('Number of houses in each zone'); # The plot title
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('MSZoning', 'SalePrice', data=train_data)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 2)
sns.stripplot('MSZoning', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 3)
sns.boxplot('MSZoning', 'SalePrice', data=train_data, showmeans=True)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='BedroomAbvGr', data=train_data);
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('BedroomAbvGr', 'SalePrice', data=train_data)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 2)
sns.stripplot('BedroomAbvGr', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.subplot(1, 3, 3)
sns.boxplot('BedroomAbvGr', 'SalePrice', data=train_data, showmeans=True)
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='KitchenQual', data=train_data, order=['Po','Fa','TA','Gd','Ex'])
plt.xlabel('Kitchen quality');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('KitchenQual', 'SalePrice', data=train_data, order=['Fa','TA','Gd','Ex'])
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.subplot(1, 3, 2)
sns.stripplot('KitchenQual', 'SalePrice', data=train_data, size = 5, jitter = True, order=['Fa','TA','Gd','Ex'])
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.subplot(1, 3, 3)
sns.boxplot('KitchenQual', 'SalePrice', data=train_data, showmeans=True, order=['Fa','TA','Gd','Ex'])
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('KitchenAbvGr', 'SalePrice', data=train_data)
plt.xlabel('Number of kitchens');
plt.subplot(1, 3, 2)
sns.stripplot('KitchenAbvGr', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Number of kitchens');
plt.subplot(1, 3, 3)
sns.boxplot('KitchenAbvGr', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Number of kitchens');
train_data[train_data.KitchenAbvGr == 0].KitchenQual
train_data.loc[train_data[train_data.KitchenAbvGr == 0].index[0],'KitchenQual'] = 'None'
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('KitchenQual', 'SalePrice', data=train_data, order=['None', 'Fa','TA','Gd','Ex'])
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.subplot(1, 3, 2)
sns.stripplot('KitchenQual', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None', 'Fa','TA','Gd','Ex'])
plt.xticks(rotation=45) # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.subplot(1, 3, 3)
sns.boxplot('KitchenQual', 'SalePrice', data=train_data, showmeans=True, order=['None', 'Fa','TA','Gd','Ex'])
plt.xticks(rotation=45); # Rotate the x-ticks with 45 degrees
plt.xlabel('Kitchen quality');
plt.figure(figsize=(15,5))
sns.swarmplot('PoolArea','SalePrice', data=train_data, hue='PoolQC', hue_order=['None', 'Fa','Gd','Ex'])
plt.xlabel('Pool area in square feet');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('PoolQC', 'SalePrice', data=train_data, order=['None', 'Fa','Gd','Ex'])
plt.xlabel('Pool Quality');
plt.subplot(1, 3, 2)
sns.stripplot('PoolQC', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None', 'Fa','Gd','Ex'])
plt.xlabel('Pool Quality');
plt.subplot(1, 3, 3)
sns.boxplot('PoolQC', 'SalePrice', data=train_data, showmeans=True, order=['None', 'Fa','Gd','Ex'])
plt.xlabel('Pool Quality');
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='Heating', data=train_data)
plt.xlabel('Type of heating');
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='HeatingQC', data=train_data, order=['Po','Fa','TA','Gd','Ex'])
plt.xlabel('Heating quality and conditiong');
# Transform to a count
count = train_data.groupby(['Neighborhood', 'HeatingQC']).HeatingQC.count()
# Re-create a new array of levels
levels = [count.index.levels[0].values, count.index.levels[1].values]
new_index = pd.MultiIndex.from_product(levels, names=count.index.names)
# Reindex the count and fill empty values with zero (NaN by default)
count = count.reindex(new_index, fill_value=0)
# Construct a panda dataframe
df = pd.DataFrame(count).rename(columns={"HeatingQC": "Count"}).reset_index()
g = sns.FacetGrid(df, col="Neighborhood", col_wrap=5, hue="Neighborhood")
g = g.map(sns.barplot, "HeatingQC", "Count")
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('HeatingQC', 'SalePrice', data=train_data, order=['Po','Fa','TA','Gd','Ex'])
plt.xlabel('Heating quality and condition');
plt.subplot(1, 3, 2)
sns.stripplot('HeatingQC', 'SalePrice', data=train_data, size = 5, jitter = True, order=['Po','Fa','TA','Gd','Ex'])
plt.xlabel('Heating quality and condition');
plt.subplot(1, 3, 3)
sns.boxplot('HeatingQC', 'SalePrice', data=train_data, showmeans=True, order=['Po','Fa','TA','Gd','Ex'])
plt.xlabel('Heating quality and condition');
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='BsmtQual', data=train_data, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('The height of the basement');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('BsmtQual', 'SalePrice', data=train_data, order=['None','Fa','TA','Gd','Ex'])
plt.xlabel('The height of the basement');
plt.subplot(1, 3, 2)
sns.stripplot('BsmtQual', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None','Fa','TA','Gd','Ex'])
plt.xlabel('The height of the basement');
plt.subplot(1, 3, 3)
sns.boxplot('BsmtQual', 'SalePrice', data=train_data, showmeans=True, order=['None','Fa','TA','Gd','Ex'])
plt.xlabel('The height of the basement');
train_data['BsmtFinSF'] = train_data['BsmtFinSF1'] + train_data['BsmtFinSF2']
plt.figure(figsize=(15,5))
sns.regplot(x='BsmtFinSF', y='SalePrice', data=train_data, label="Finished basement square feet")
sns.regplot(x='BsmtUnfSF', y='SalePrice', data=train_data, marker="*", label="Unfinished square feet of basement area")
plt.title('Sales price vs basement square feet')
plt.xlabel('Basement square feet')
plt.legend();
plt.figure(figsize=(15,5))
sns.regplot(x='GarageArea', y='SalePrice', data=train_data)
plt.xlabel('Size of garage in square feet');
train_data['GarageYrBlt2'] = train_data['GarageYrBlt'].astype(int).apply(lambda x: "[" + str(x - x %10) + "," + str(x - x %10 + 10) + "[")
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('GarageYrBlt2', 'SalePrice', data=train_data, order=sorted(train_data['GarageYrBlt2'].unique().tolist()))
plt.xlabel('Year garage was built')
plt.xticks(rotation=90); # Rotate the x-ticks with 90 degrees
plt.subplot(1, 3, 2)
sns.stripplot('GarageYrBlt2', 'SalePrice', data=train_data, size = 5, jitter = True, order=sorted(train_data['GarageYrBlt2'].unique().tolist()))
plt.xlabel('Year garage was built')
plt.xticks(rotation=90); # Rotate the x-ticks with 90 degrees
plt.subplot(1, 3, 3)
sns.boxplot('GarageYrBlt2', 'SalePrice', data=train_data, showmeans=True, order=sorted(train_data['GarageYrBlt2'].unique().tolist()))
plt.xlabel('Year garage was built')
plt.xticks(rotation=90); # Rotate the x-ticks with 90 degrees
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('GarageFinish', 'SalePrice', data=train_data, order=['None','Unf','RFn','Fin'])
plt.xlabel('Interior finish of the garage');
plt.subplot(1, 3, 2)
sns.stripplot('GarageFinish', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None','Unf','RFn','Fin'])
plt.xlabel('Interior finish of the garage');
plt.subplot(1, 3, 3)
sns.boxplot('GarageFinish', 'SalePrice', data=train_data, showmeans=True, order=['None','Unf','RFn','Fin'])
plt.xlabel('Interior finish of the garage');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('GarageQual', 'SalePrice', data=train_data, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('Garage quality');
plt.subplot(1, 3, 2)
sns.stripplot('GarageQual', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('Garage quality');
plt.subplot(1, 3, 3)
sns.boxplot('GarageQual', 'SalePrice', data=train_data, showmeans=True, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('Garage quality');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('ExterQual', 'SalePrice', data=train_data, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('The quality of the material on the exterior');
plt.subplot(1, 3, 2)
sns.stripplot('ExterQual', 'SalePrice', data=train_data, size = 5, jitter = True, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('The quality of the material on the exterior');
plt.subplot(1, 3, 3)
sns.boxplot('ExterQual', 'SalePrice', data=train_data, showmeans=True, order=['None','Po','Fa','TA','Gd','Ex'])
plt.xlabel('The quality of the material on the exterior');
train_data["TotalBath"] = (train_data["FullBath"] + train_data["BsmtFullBath"]) + 0.5 * (train_data["HalfBath"] + train_data["BsmtHalfBath"])
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('TotalBath', 'SalePrice', data=train_data)
plt.xlabel('Total bathrooms');
plt.subplot(1, 3, 2)
sns.stripplot('TotalBath', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Total bathrooms');
plt.subplot(1, 3, 3)
sns.boxplot('TotalBath', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Total bathrooms');
train_data["FlrSF"] = train_data["1stFlrSF"] + train_data["2ndFlrSF"]
plt.figure(figsize=(15,5))
sns.regplot(x='FlrSF', y='SalePrice', data=train_data)
plt.xlabel('Floor square feet');
plt.figure(figsize=(15,5))
sns.regplot(x='OpenPorchSF', y='SalePrice', data=train_data[train_data.OpenPorchSF > 0])
plt.title('Sale Prices for Open Porches')
plt.xlabel('Porch area in square feet');
plt.figure(figsize=(15,5))
sns.regplot(x='EnclosedPorch', y='SalePrice', data=train_data[train_data.EnclosedPorch > 0], color='g')
plt.title('Sales Price for Enclosed Porches')
plt.xlabel('Porch area in square feet');
plt.figure(figsize=(15,5))
sns.regplot(x='3SsnPorch', y='SalePrice', data=train_data[train_data['3SsnPorch'] > 0], color='r')
plt.title('Sales Price for Three Season Porches')
plt.xlabel('Porch area in square feet');
plt.figure(figsize=(15,5))
sns.regplot(x='ScreenPorch', y='SalePrice', data=train_data[train_data.ScreenPorch > 0], color='b')
plt.title('Sales Price for Screened Porches')
plt.xlabel('Porch area in square feet');
train_data["PorchSF"] = train_data["OpenPorchSF"] + train_data["EnclosedPorch"] + train_data["3SsnPorch"] + train_data["ScreenPorch"]
plt.figure(figsize=(15,5))
sns.regplot(x='PorchSF', y='SalePrice', data=train_data, label='PorchSF', fit_reg=True)
sns.regplot(x='PorchSF', y='SalePrice', data=train_data[train_data.PorchSF > 0], label='PorchSF > 0')
plt.xlabel('Porch area in square feet')
plt.legend();
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('SaleCondition', 'SalePrice', data=train_data)
plt.xlabel('Condition of sale');
plt.subplot(1, 3, 2)
sns.stripplot('SaleCondition', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Condition of sale');
plt.subplot(1, 3, 3)
sns.boxplot('SaleCondition', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Condition of sale');
plt.figure(figsize=(15,5))
sns.regplot(x='MasVnrArea', y='SalePrice', data=train_data, label='MasVnrArea')
sns.regplot(x='MasVnrArea', y='SalePrice', data=train_data[train_data.MasVnrArea > 0], label='MasVnrArea > 0')
plt.xlabel('Masonry veneer area in square feet')
plt.legend();
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('MasVnrType', 'SalePrice', data=train_data)
plt.xlabel('Masonry veneer type');
plt.subplot(1, 3, 2)
sns.stripplot('MasVnrType', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Masonry veneer type');
plt.subplot(1, 3, 3)
sns.boxplot('MasVnrType', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Masonry veneer type');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('Street', 'SalePrice', data=train_data)
plt.xlabel('Type of road access to property');
plt.subplot(1, 3, 2)
sns.stripplot('Street', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Type of road access to property');
plt.subplot(1, 3, 3)
sns.boxplot('Street', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Type of road access to property');
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('Alley', 'SalePrice', data=train_data)
plt.xlabel('Type of alley access to property');
plt.subplot(1, 3, 2)
sns.stripplot('Alley', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.xlabel('Type of alley access to property');
plt.subplot(1, 3, 3)
sns.boxplot('Alley', 'SalePrice', data=train_data, showmeans=True)
plt.xlabel('Type of alley access to property');
train_data['NoAlley'] = (train_data['Alley'] == 'None') * 1
plt.subplots(figsize =(20, 5))
plt.subplot(1, 3, 1)
sns.barplot('NoAlley', 'SalePrice', data=train_data)
plt.subplot(1, 3, 2)
sns.stripplot('NoAlley', 'SalePrice', data=train_data, size = 5, jitter = True)
plt.subplot(1, 3, 3)
sns.boxplot('NoAlley', 'SalePrice', data=train_data, showmeans=True);
plt.figure(figsize=(20,5)) # The figure size
sns.countplot(x='MoSold', data=train_data)
plt.xlabel('Month');
# Construct a dataframe from the training data csv file
train_data = pd.read_csv(path_data + "train.csv")
# Drop ID
train_data.drop(['Id'], axis=1, inplace=True)
# Clean missing values
impute_data(train_data)
# Remove outliers we found before in our training data
train_data.drop(train_data[train_data.GrLivArea > 4000].index, inplace=True)
train_data.loc[train_data[train_data.KitchenAbvGr == 0].index[0],'KitchenQual'] = 'None'
plt.figure(figsize=(20,5))
sns.distplot(train_data['SalePrice'])
print("Skewness is:", train_data['SalePrice'].skew())
print("Kurtosis is:", train_data['SalePrice'].kurt())
train_data['SalePrice'] = np.log1p(train_data['SalePrice'])
plt.figure(figsize=(20,5))
sns.distplot(train_data['SalePrice'])
print("Skewness is:", train_data['SalePrice'].skew())
print("Kurtosis is:", train_data['SalePrice'].kurt())
def get_numerical_nominal(df):
numerical_ordinal_feautures = df.select_dtypes(include=['number']).columns.tolist()
nominal_feautures = df.select_dtypes(exclude=['number']).columns.tolist()
print("Number of features:", df.shape[1])
print("Number of numerical (discrete and continuous) and ordinal features:", len(numerical_ordinal_feautures))
print("Number of nominal features:", len(nominal_feautures))
return numerical_ordinal_feautures, nominal_feautures
get_numerical_nominal(train_data);
def replace_nominal(df):
# 1 indicates if the street is paved and 0 if it is not.
df['Street'] = (df['Street'] == 'Pave') * 1
# 2 indicates if the alley is paved and 1 if it is not. 0 if there is no alley access.
street_alley_dict = {'None': 0, 'Grvl': 1, 'Pave': 2}
df['Alley'].replace(street_alley_dict, inplace=True)
# (Reg = Regular) - (IR1 = Slightly irregular) - (IR2 = Moderately Irregular) - (IR3 = Irregular)
lotshape_dict = {'IR3' : 1, 'IR2' : 2, 'IR1' : 3, 'Reg' : 4}
df['LotShape'].replace(lotshape_dict, inplace=True)
# (Gtl = Gentle slope) - (Mod = Moderate Slope) - (Sev = Severe Slope)
landslope_dict = {'Sev' : 1, 'Mod' : 2, 'Gtl' : 3}
df['LandSlope'].replace(landslope_dict, inplace=True)
# (Ex = Excellent) - (Gd = Good) - (TA = Average/Typical) - (Fa = Fair )- (Po = Poor)
qual_dict = {'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
qual_columns = [
'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
'GarageCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
'GarageQual', 'PoolQC'
]
df[qual_columns] = df[qual_columns].replace(qual_dict)
# Good > Average > Minimum > No exposure
exposure_dict = {'None': 0, 'No': 0, 'Mn': 1, 'Av': 2, 'Gd': 3}
df['BsmtExposure'].replace(exposure_dict, inplace=True)
# (GLQ = Good Living Quarters) - (ALQ = Average Living Quarters) - (BLQ = Below Average Living Quarters)
# (Rec = Average Rec Room) - (LwQ = Low Quality) - (Unf = Unfinshed) - (None = No Basement)
bsmtfintype_dict = {'None': 0, 'No': 0, 'Unf' : 1, 'LwQ': 2,
'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 6}
df[['BsmtFinType1', 'BsmtFinType2']] = df[['BsmtFinType1', 'BsmtFinType2']].replace(bsmtfintype_dict)
bool_dict = {'N': 0, 'Y': 1}
df['CentralAir'].replace(bool_dict, inplace=True)
# (Fin = Finished) - (RFn = Rough Finished) - (Unf = Unfinished)
finished_dict = {'None': -0.5, 'Unf': 0., 'RFn': 0.5, 'Fin': 1.}
df['GarageFinish'].replace(finished_dict, inplace=True)
paved_dict = {'N': 0., 'P': 0.5, 'Y': 1.}
df['PavedDrive'].replace(paved_dict, inplace=True)
functional_dict = {'Sal' : 1, 'Sev' : 2, 'Maj2' : 3, 'Maj1' : 4,
'Mod': 5, 'Min2' : 6, 'Min1' : 7, 'Typ' : 8}
df['Functional'].replace(functional_dict, inplace=True)
replace_nominal(train_data)
def replace_ordinal(df):
MSSubClass_dict = {
20 : "MSSC20", 30 : "MSSC30", 40 : "MSSC40", 45 : "MSSC45",
50 : "MSSC50", 60 : "MSSC60", 70 : "MSSC70", 75 : "MSSC75",
80 : "MSSC80", 85 : "MSSC85", 90 : "MSSC90", 120 : "MSSC120",
150 : "MSSC150", 160 : "MSSC160", 180 : "MSSC180", 190 : "MSSC190"
}
df['MSSubClass'].replace(MSSubClass_dict, inplace=True)
month_dict = {
1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun",
7 : "Jul", 8 : "Aug", 9 : "Sep", 10 : "Oct", 11 : "Nov", 12 : "Dec"
}
df['MoSold'].replace(month_dict, inplace=True)
replace_ordinal(train_data)
def add_features(df):
df['AgeAtSell'] = np.abs(df['YrSold'] - df['YearBuilt'])
df['RemodelingAgeAtSell'] = np.abs(df['YrSold'] - df['YearRemodAdd'])
df['GarageAgeAtSell'] = np.abs(df['YrSold'] - df['GarageYrBlt'])
df.drop(['YrSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt'], axis=1, inplace=True)
df["TotalBath"] = (df["FullBath"] + df["BsmtFullBath"]) + 0.5 * (df["HalfBath"] + df["BsmtHalfBath"])
df.drop(["FullBath", "BsmtFullBath", "HalfBath", "BsmtHalfBath"], axis=1, inplace=True)
df["TotalSF"] = df["TotalBsmtSF"] + df["1stFlrSF"] + df["2ndFlrSF"]
df.drop(["TotalBsmtSF", "1stFlrSF", "2ndFlrSF"], axis=1, inplace=True)
df["PorchSF"] = df["OpenPorchSF"] + df["EnclosedPorch"] + df["3SsnPorch"] + df["ScreenPorch"]
df.drop(["OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch"], axis=1, inplace=True)
df["BsmtFinSF"] = df["BsmtFinSF1"] + df["BsmtFinSF2"]
df.drop(["BsmtFinSF1", "BsmtFinSF2"], axis=1, inplace=True)
add_features(train_data)
# Sub-categories
numerical_ordinal_features, nominal_features = get_numerical_nominal(train_data)
ordinal_features = ['Street', 'Alley', 'LotShape', 'LandSlope', 'OverallQual', 'OverallCond',
'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'AgeAtSell',
'RemodelingAgeAtSell', 'GarageAgeAtSell'
]
# Big categories
categorical_features = ordinal_features + nominal_features
numerical_features = list(set(numerical_ordinal_features).difference(set(ordinal_features)))
print("Number of categorical features:", len(categorical_features))
print("Number of numerical features:", len(numerical_features))
skewed_feats = train_data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed_feats.plot(kind='barh', figsize=(15, 7));
We're going to transform features with skewness higher than $0.5$ using Box-Cox transformation:
${\displaystyle y_{i}^{(\lambda )}={\begin{cases}{\dfrac {y_{i}^{\lambda }-1}{\lambda }}&{\text{if }}\lambda \neq 0,\\[8pt]\ln {(y_{i})}&{\text{if }}\lambda =0,\end{cases}}}$
skewness = skewed_feats[abs(skewed_feats) > 0.5]
lam = 0.15
if (apply_boxcox):
train_data[skewness.index] = boxcox1p(train_data[skewness.index], lam)
print(skewness.shape[0], "skewed numerical features have been Box-Cox transformed")
else:
print("Box-Cox disabled")
df = train_data[numerical_features] # Our training data with only the numerical data
method = 'pearson'
df.corr(method=method).style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
corr_pearson = df.corr(method)
mask = np.zeros_like(corr_pearson)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,15))
with sns.axes_style("white"):
ax0 = sns.heatmap(corr_pearson, mask=mask, vmin=-1., vmax=1.,
square=True, cmap=plt.get_cmap('coolwarm'),
linewidths=0.3, annot=True, cbar=False)
for text in ax0.texts:
t = float(text.get_text())
if (abs(t) < 0.8):
text.set_text('')
else:
text.set_text(round(t, 4))
def R2_print(corr, feature1, feature2):
corr2 = corr[feature1][feature2]
print("- A correlation of", round(corr2, 4), "between", feature1, "and", feature2, "implies that",
100 * round((corr2 ** 2), 4), "% of variation in ", feature2, " can be explained by ", feature1, " and vice-versa.")
print("--- The correlation between", feature1, "and the sale price is:", round(corr[feature1]["SalePrice"], 4) ,".")
print("--- The correlation between", feature2, "and the sale price is:", round(corr[feature2]["SalePrice"], 4) ,".\n")
R2_print(corr_pearson, 'GrLivArea','TotalSF')
R2_print(corr_pearson, 'TotRmsAbvGrd','GrLivArea')
R2_print(corr_pearson, 'GarageArea','GarageCars')
#Additional
R2_print(corr_pearson, 'TotRmsAbvGrd','TotalSF')
if (drop_highly_correlated):
columns_to_remove = ['GrLivArea', 'GarageArea']
train_data.drop(columns_to_remove, axis=1, inplace=True)
for x in columns_to_remove:
numerical_features.remove(x)
numerical_ordinal_features.remove(x)
else:
print("Drop highly correlated featues is disabled")
df = train_data[numerical_ordinal_features] # Our training data with only the numerical and ordinal data
method='spearman'
df.corr(method=method).style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
corr = df.corr(method)
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(25,25))
with sns.axes_style("white"):
ax0 = sns.heatmap(corr, mask=mask, vmin=-1., vmax=1.,
square=True, cmap=plt.get_cmap('coolwarm'),
linewidths=0.2, annot=True, cbar=False)
for text in ax0.texts:
t = float(text.get_text())
if (abs(t) < 0.8):
text.set_text('')
else:
text.set_text(round(t, 4))
features = df.columns.tolist()
features.remove('SalePrice')
df2 = pd.DataFrame(
list(map(lambda feature: df[feature].corr(df['SalePrice'], method), features)),
index=features, columns=['Correlation'])
df2 = df2.sort_values('Correlation', ascending=True)
# Let's plot the ranking of the features
sns.factorplot(y="Correlation", x="index", data = df2.reset_index(), kind="bar",
size=8, aspect=1.9, palette='coolwarm')
plt.axhline(y=-0.5, color='g', linestyle='--')
plt.axhline(y=0.5, color='g', linestyle='--')
plt.xlabel('Features')
plt.xticks(rotation=90); # Rotate the x-ticks with 90 degrees
if (drop_highly_correlated):
columns_to_remove = ['Fireplaces', 'PoolArea']
train_data.drop(columns_to_remove, axis=1, inplace=True)
for x in columns_to_remove:
numerical_features.remove(x)
numerical_ordinal_features.remove(x)
columns_to_remove = ['GarageCond']
train_data.drop(columns_to_remove, axis=1, inplace=True)
for x in columns_to_remove:
ordinal_features.remove(x)
numerical_ordinal_features.remove(x)
categorical_features.remove(x)
else:
print("Drop highly correlated featues is disabled")
features_highly_correlated = df2[abs(df2) > 0.5].dropna().index.tolist()
features_highly_correlated = list(set(features_highly_correlated).intersection(set(numerical_ordinal_features)))
print(features_highly_correlated)
for a in features_highly_correlated:
if a in numerical_features:
R2_print(corr_pearson, a,'SalePrice')
if (apply_featurizing):
features_to_featurize = [
'AgeAtSell', 'RemodelingAgeAtSell', 'GarageFinish','ExterQual',
'KitchenQual', 'OverallQual', 'FireplaceQu', 'BsmtQual',
'TotalBath', 'TotRmsAbvGrd', 'GarageCars'
]
def featurize(df):
for feature in features_to_featurize:
df[feature + '-2'] = np.sign(df[feature]) * (df[feature] ** 2)
return df
train_data = featurize(train_data)
for feature in features_to_featurize:
numerical_ordinal_features.append(feature + '-2')
if feature in numerical_features:
numerical_features.append(feature + '-2')
else:
ordinal_features.append(feature + '-2')
categorical_features.append(feature + '-2')
else:
print("Featurizing is disabled")
def anova(frame):
anv = pd.DataFrame()
anv['feature'] = categorical_features
pvals = []
for c in categorical_features:
samples = []
for cls in frame[c].unique():
s = frame[frame[c] == cls]['SalePrice'].values
samples.append(s)
pval = f_oneway(*samples)[1]
pvals.append(pval)
anv['pval'] = pvals
return anv.sort_values('pval')
a = anova(train_data)
a.loc[a[a.feature == 'OverallQual'].index,'pval'] = 1e-300
a.loc[a[a.feature == 'OverallQual2'].index,'pval'] = 1e-300
a['Disparity'] = - np.log(a['pval'].values)
# Let's plot the ranking of the features
sns.factorplot(x='feature', y='Disparity', data = a, kind="bar",
size=8, aspect=1.9, palette='coolwarm')
plt.xlabel('Features')
plt.xticks(rotation=90); # Rotate the x-ticks with 90 degrees
def RMSE_error(model, X, y, cv=20):
# Work out the RMSE error
# Perform cv-fold cross validation
kf = KFold(n_splits=cv, shuffle=True, random_state=RANDOM_STATE).get_n_splits(X)
return np.sqrt(-cross_val_score(model, X, y, cv=kf, scoring = 'neg_mean_squared_error').mean())
ranks_categorical = {} # Dictionnary of RMSE
Y_train = train_data.SalePrice
for i in range(len(a)):
columns_iter = a[:i].feature.tolist()
nominal_iter = list(set(columns_iter).intersection(nominal_features))
ordinal_iter = list(set(columns_iter).intersection(ordinal_features))
if (nominal_iter):
df_nominal = pd.get_dummies(train_data[nominal_iter], drop_first=True)
else:
df_nominal = pd.DataFrame()
train_sub = pd.concat([train_data[numerical_features], train_data[ordinal_iter], df_nominal], axis=1).drop('SalePrice', axis=1)
X_train_sub = train_sub.as_matrix() # The input matrix
# Create linear regression object
regr = make_pipeline(RobustScaler(), Lasso(alpha = 0.001, random_state=RANDOM_STATE))
ranks_categorical[i] = RMSE_error(regr, X_train_sub, Y_train)
pd.DataFrame(pd.Series(ranks_categorical), columns=['Score'])
best_categorical_features = a[:52].feature.tolist()
print("Best categorical features:\n", " - ".join(best_categorical_features))
best_nominal_features = list(set(nominal_features).intersection(best_categorical_features))
best_ordinal_features = list(set(best_categorical_features).intersection(ordinal_features))
numerical_ordinal_features = numerical_features + best_ordinal_features
df_nominal = pd.get_dummies(train_data[best_nominal_features], drop_first=True)
train_data = pd.concat([train_data[numerical_ordinal_features], df_nominal], axis=1)
best_nominal_features = df_nominal.columns.tolist()
ranks = {} # Dictionnary of RMSE
best_features_dic = {} # List of best features given a method
Y_train = train_data.SalePrice.values # The sale price represening the target vectorr
df = pd.DataFrame(train_data[numerical_ordinal_features]).drop('SalePrice', axis=1) # Our training data with only the numerical and ordinal data
#df = train_data[numerical_ordinal_features].drop('SalePrice', axis=1) # Our training data with only the numerical and ordinal data
X_train = df.values # The input matrix
colnames = df.columns # The feature names
def ranking(ranks, names, order=1):
minmax = MinMaxScaler()
ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
ranks = map(lambda x: round(x,2), ranks)
return pd.DataFrame(pd.Series(dict(zip(names, ranks))), columns=['Score']).sort_values('Score', ascending=False)
def KFold_predict(model, X, y, cv=5):
# Work out the RMSE error
# Perform cv-fold cross validation
kf = KFold(n_splits=cv, shuffle=True, random_state=RANDOM_STATE).get_n_splits(X)
return cross_val_predict(model, X, y, cv=kf)
def perform_regression(method_name):
print("---- Performing linear regression using features of the method:", method_name, "----")
train_sub = train_data[list(best_features_dic[method_name]) + best_nominal_features]
X_train_sub = train_sub.as_matrix() # The input matrix
# Create linear regression object
regr = make_pipeline(RobustScaler(), Lasso(alpha = 0.001, random_state=RANDOM_STATE))
# Make predictions using the testing set
Y_pred = KFold_predict(regr, X_train_sub, Y_train)
ranks[method_name] = RMSE_error(regr, X_train_sub, Y_train)
print("RMSE: %.4f" % ranks[method_name])
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(Y_train, Y_pred))
best_features_dic["All features"] = colnames
perform_regression("All features")
select_feature = SelectKBest(f_regression).fit(X_train, Y_train)
best_df = ranking(np.abs(select_feature.scores_), colnames)
best_df = best_df[best_df.Score > 0]
best_features_dic["KBest"] = best_df.index.tolist()
print('Score list:')
best_df
perform_regression("KBest")
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, random_state=RANDOM_STATE)
rf.fit(X_train, Y_train)
best_df2 = ranking(rf.feature_importances_, colnames)
best_df2 = best_df2[best_df2.Score > 0]
best_features_dic["Random Forest"] = best_df2.index.tolist()
print('Score list:')
best_df2
perform_regression("Random Forest")
# Use linear regression as the model
lr = Lasso(alpha = 0.001, random_state=RANDOM_STATE)
# rank all features, i.e continue the elimination until the last one
# stop the search when 1 features are left
rfe = RFE(estimator=lr, n_features_to_select=1)
rfe.fit(X_train, Y_train)
best_df3 = ranking(list(map(float, rfe.ranking_)), colnames, order=-1)
best_df3 = best_df3[best_df3.Score > 0]
best_features_dic["RFE"] = best_df3.index.tolist()
print('Score list:')
best_df3
perform_regression("RFE")
model_xgb = xgb.XGBRegressor(random_state=RANDOM_STATE)
model_xgb.fit(X_train, Y_train)
best_df4 = ranking(model_xgb.feature_importances_, colnames)
best_df4 = best_df4[best_df4.Score > 0]
best_features_dic["XGBoost"] = best_df4.index.tolist()
print('Score list:')
best_df4
perform_regression("XGBoost")
ridge = Ridge(alpha = 0.0006, random_state=RANDOM_STATE)
ridge.fit(X_train, Y_train)
best_df6 = ranking(np.abs(ridge.coef_), colnames)
best_df6 = best_df6[best_df6.Score > 0]
best_features_dic["Ridge"] = best_df6.index.tolist()
print('Score list:')
best_df6
perform_regression("Ridge")
df_ranks = pd.DataFrame(pd.Series(ranks), columns=['RMSE'])
df_ranks = df_ranks.sort_values('RMSE', ascending=True)
df_ranks
# Let's plot the ranking of the features
sns.factorplot(x="RMSE", y="index", data = df_ranks.reset_index(), kind="bar",
size=8, aspect=1.9, palette='coolwarm', orient="h");
best_numerical_ordinal_features = best_features_dic[df_ranks.reset_index().loc[0]['index']]
print(best_numerical_ordinal_features)
def prepare_data(df):
# Impute the missing values
impute_data(df)
# Drop the Id column because it's useless
df.drop(['Id'], axis=1, inplace=True)
# Skewness of target variable
df['SalePrice'] = np.log1p(df['SalePrice'])
# Replace nominal features
replace_nominal(df)
# Replace ordinal features that are actually nominal
replace_ordinal(df)
# Add the new features
add_features(df)
# skewed features
if (apply_boxcox):
df[skewness.index] = boxcox1p(df[skewness.index], lam)
# Drop the columns that are highly correlated
if (drop_highly_correlated):
df.drop([
'GrLivArea', 'GarageArea', 'Fireplaces', 'PoolArea', 'GarageCond'
], axis=1, inplace=True)
if (apply_featurizing):
df = featurize(df)
#best_nominal_features = list(set(nominal_features).intersection(best_categorical_features))
df_nominal = pd.get_dummies(df[list(set(nominal_features).intersection(best_categorical_features))], drop_first=True)
df_columns = df_nominal.columns.tolist()
# Add the encoded columns to the test data in case the number of new features are not the same
for x in best_nominal_features:
if (x not in df_columns):
df_nominal[x] = np.zeros(df[x.split("_")[0]].shape)
# Train data doesn't contain RoofMatl_CompShg
if ("RoofMatl_CompShg" not in df_nominal):
df_nominal["RoofMatl_CompShg"] = np.zeros(df["RoofMatl"].shape)
X = pd.concat([df[best_numerical_ordinal_features], df_nominal], axis=1).values # The input vector containing only the best features
y = df.SalePrice.values # The sale price represening the target vector
return X, y
# Construct a dataframe from the training data csv file
train_data = pd.read_csv(path_data + "train.csv")
# Remove outliers we found before in our training data
train_data.drop(train_data[train_data.GrLivArea > 4000].index, inplace=True)
train_data.loc[train_data[train_data.KitchenAbvGr == 0].index[0],'KitchenQual'] = 'None'
# Construct a clean version of the training data ready to be used in our machine learning models
X, y = prepare_data(train_data)
# Linear regression
print("---- Fitting Linear regression ----\n")
t0 = time()
regr = make_pipeline(RobustScaler(), LinearRegression())
# Perform 10-fold cross validation
y_pred = KFold_predict(regr, X, y)
print("RMSE: %.4f" % RMSE_error(regr, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Ridge regression
print("---- Fitting Ridge regression ----\n")
t0 = time()
ridge = RidgeCV(alphas =
[0.001, 0.003,
0.006, 0.009, 0.01, 0.03, 0.06, 0.09, 0.1, 0.3,
0.6, 0.9, 1, 3, 6, 10, 30, 60
])
regr_ridge = make_pipeline(RobustScaler(), ridge)
regr_ridge.fit(X, y)
alpha = ridge.alpha_
print("-- 1st step -- \n\t Best alpha %.6f" %alpha)
ridge = RidgeCV(alphas = [
alpha, alpha * .06, alpha * .065, alpha * .07, alpha * .075, alpha * .08, alpha * .085,
alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,
alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4
])
regr_ridge = make_pipeline(RobustScaler(), ridge)
regr_ridge.fit(X, y)
alpha = ridge.alpha_
print("-- 2nd step -- \n\t Best alpha %.6f\n" %alpha)
# Perform 10-fold cross validation
y_pred = KFold_predict(regr_ridge, X, y)
print("RMSE: %.4f" % RMSE_error(regr_ridge, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Lasso regression
print("---- Fitting Lasso regression ----\n")
t0 = time()
lasso = LassoCV(alphas =
[0.001, 0.003,
0.006, 0.009, 0.01, 0.03, 0.06, 0.09, 0.1, 0.3,
0.6, 0.9, 1, 3, 6, 10, 30, 60
], random_state=RANDOM_STATE)
regr_lasso = make_pipeline(RobustScaler(), lasso)
regr_lasso.fit(X, y)
alpha = lasso.alpha_
print("-- 1st step -- \n\t Best alpha %.6f" %alpha)
lasso = LassoCV(alphas = [
alpha, alpha * .06, alpha * .065, alpha * .07, alpha * .075, alpha * .08, alpha * .085,
alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,
alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4
], random_state=RANDOM_STATE)
regr_lasso = make_pipeline(RobustScaler(), lasso)
regr_lasso.fit(X, y)
alpha = lasso.alpha_
print("-- 2nd step -- \n\t Best alpha %.6f\n" %alpha)
# Perform 10-fold cross validation
y_pred = KFold_predict(regr_lasso, X, y)
print("RMSE: %.4f" % RMSE_error(regr_lasso, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Elastic net
print("---- Fitting Elastic ----\n")
t0 = time()
elasticn = ElasticNetCV(alphas=[
0.0001, 0.0003, 0.0006, 0.0009,
0.001, 0.003, 0.006, 0.009,
0.01, 0.03, 0.06, 0.09,
0.1,0.3, 0.6, 0.9,
1, 3, 6
], l1_ratio=[
0.01, 0.03, 0.05, 0.06, 0.07, 0.08, 0.085, 0.09, 0.095,
0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1], random_state=RANDOM_STATE)
regr_elastic = make_pipeline(RobustScaler(), elasticn)
regr_elastic.fit(X, y)
# Best parameter
alpha = elasticn.alpha_
l1_ratio = elasticn.l1_ratio_
print("-- 1st step -- \n\t Best alpha %.6f and best l1_ratio %.6f" %(alpha, l1_ratio))
elasticn = ElasticNetCV(alphas=[
alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,
alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4
], l1_ratio=[
l1_ratio * .85, l1_ratio * .9, l1_ratio * .95,
l1_ratio, l1_ratio * 1.05, l1_ratio * 1.1, l1_ratio * 1.15
], random_state=RANDOM_STATE)
regr_elastic = make_pipeline(RobustScaler(), elasticn)
regr_elastic.fit(X, y)
# Best parameter
alpha = elasticn.alpha_
l1_ratio = elasticn.l1_ratio_
print("-- 2nd step -- \n\t Best alpha %.6f and best l1_ratio %.6f\n" %(alpha, l1_ratio))
# Perform 10-fold cross validation
y_pred = KFold_predict(regr_elastic, X, y)
print("RMSE: %.4f" % RMSE_error(regr_elastic, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Average modeling
print("---- Looking for best parameters for Kernel Ridge Regression ----\n")
gscv = GridSearchCV(
KernelRidge(kernel='polynomial', degree=1, coef0=100), cv = 10,
scoring = 'neg_mean_squared_error',
param_grid = {
"alpha": [
0.01, 0.03, 0.06, 0.09,
0.1, 0.3, 0.6, 0.9,
1, 3, 6
],
"gamma": [
0.01, 0.03, 0.06, 0.09,
0.1, 0.3, 0.6, 0.9,
1, 3, 6
]
}
)
gscv.fit(X, y)
print("1st phase\n\tBest parameters:", gscv.best_params_)
alpha = gscv.best_params_["alpha"]
gamma = gscv.best_params_["gamma"]
gscv = GridSearchCV(
KernelRidge(kernel='polynomial', degree=1, coef0=100), cv = 10,
scoring = 'neg_mean_squared_error',
param_grid = {
"alpha": [
alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,
alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4
],
"gamma": [
gamma * .6, gamma * .65, gamma * .7, gamma * .75, gamma * .8, gamma * .85,
gamma * .9, gamma * .95, gamma, alpha * 1.05, gamma * 1.1, gamma * 1.15,
gamma * 1.25, gamma * 1.3, gamma * 1.35, gamma * 1.4
]
}
)
gscv.fit(X, y)
print("2nd phase\n\tbest parameters:", gscv.best_params_)
alpha = gscv.best_params_["alpha"]
gamma = gscv.best_params_["gamma"]
# Average modeling
print("---- Fitting Kernel Ridge Regression ----\n")
t0 = time()
kernel = make_pipeline(RobustScaler(), KernelRidge(alpha=alpha, coef0=100, degree=1, gamma=gamma, kernel='polynomial'))
kernel.fit(X, y)
# Perform 10-fold cross validation
y_pred = KFold_predict(kernel, X, y)
print("RMSE: %.4f" % RMSE_error(kernel, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Average modeling
print("---- Fitting Bayesian Ridge ----\n")
t0 = time()
bay = make_pipeline(RobustScaler(),
BayesianRidge(tol=1e-5, alpha_1=1e-8, alpha_2=5e-6, lambda_1=5e-6, lambda_2=1e-8)
)
bay.fit(X, y)
# Perform 10-fold cross validation
y_pred = KFold_predict(bay, X, y)
print("RMSE: %.4f" % RMSE_error(bay, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# Gradient Boosting Regressor
print("---- Fitting Gradient Boosting Regression to the training set ----\n")
t0 = time()
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.02,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=RANDOM_STATE)
GBoost.fit(X, y)
# Perform 10-fold cross validation
y_pred = KFold_predict(GBoost, X, y)
print("RMSE: %.4f" % RMSE_error(GBoost, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
# LGBoost
print("---- Fitting LGBoost to the training set ----\n")
t0 = time()
LGBoost = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.05,
max_depth=2, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=RANDOM_STATE)
LGBoost.fit(X, y)
# Perform 10-fold cross validation
y_pred = KFold_predict(LGBoost, X, y)
print("RMSE: %.4f" % RMSE_error(LGBoost, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))
#plot_residuals_and_predictions(y, y_pred)
# XGBooster
print("---- Fitting XGBooster to the training set ----\n")
t0 = time()
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, learning_rate=0.1,
max_depth=2, min_child_weight=1.7817,
n_estimators=360, reg_alpha=0.4640,
subsample=0.5213, silent=1,
random_state = RANDOM_STATE, nthread = -1)
model_xgb.fit(X, y)
# Perform 10-fold cross validation
y_pred_xgb = KFold_predict(model_xgb, X, y)
print("RMSE: %.4f" % RMSE_error(model_xgb, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred_xgb))
print("\nTime spent: %0.3fs \n" % (time() - t0))
import lightgbm as lgb
# Linear regression
print("---- Fitting LGBMRegressor to the training set ----\n")
t0 = time()
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=RANDOM_STATE, bagging_seed=RANDOM_STATE,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11, random_state=RANDOM_STATE)
model_lgb.fit(X, y)
# Perform 10-fold cross validation
y_pred_lgb = KFold_predict(model_lgb, X, y)
print("RMSE: %.4f" % RMSE_error(model_lgb, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred_lgb))
print("\nTime spent: %0.3fs \n" % (time() - t0))
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# Cloning the original models to fit the data
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
# predicting based on the cloned models
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_])
#averaging the predictions
return np.log1p(np.mean(np.expm1(predictions), axis=1))
# Average modeling
print("---- Fitting Average modeling to the training set ----\n")
t0 = time()
averaged_models = AveragingModels(models = (regr_elastic, regr_lasso, regr_ridge, bay, GBoost, LGBoost, kernel))
averaged_models.fit(X, y)
# Perform 10-fold cross validation
y_pred = KFold_predict(averaged_models, X, y)
print("RMSE: %.4f" % RMSE_error(averaged_models, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred))
print("\nTime spent: %0.3fs \n" % (time() - t0))

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=RANDOM_STATE)
# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
# Now train the cloned meta-model using the out-of-fold predictions as new feature
self.meta_model_.fit(out_of_fold_predictions, y)
return self
#Do the predictions of all base models on the test data and use the averaged predictions as
#meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.log1p(np.column_stack([
np.expm1(np.column_stack([model.predict(X) for model in base_models])).mean(axis=1)
for base_models in self.base_models_ ]))
return self.meta_model_.predict(meta_features)
# Linear regression
print("---- Fitting Stacked Average modeling to the training set ----\n")
t0 = time()
stacked_averaged_models = StackingAveragedModels(base_models = (regr_lasso, regr_ridge, bay, GBoost, LGBoost, kernel),
meta_model = regr_elastic)
stacked_averaged_models.fit(X, y)
# Perform 10-fold cross validation
y_pred_stacked = KFold_predict(stacked_averaged_models, X, y)
print("RMSE: %.4f" % RMSE_error(stacked_averaged_models, X, y))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(y, y_pred_stacked))
print("\nTime spent: %0.3fs \n" % (time() - t0))
percentage_stacked = 0.50
percentage_xgb = 0.25
percentage_lgb = 0.25
def ensembled(X_input):
return stacked_averaged_models.predict(X_input) * percentage_stacked + model_xgb.predict(X_input) * percentage_xgb + model_lgb.predict(X_input) * percentage_lgb
# Make cross validated predictions
y_pred = y_pred_stacked * percentage_stacked + y_pred_lgb * percentage_lgb + y_pred_xgb * percentage_xgb
print("Time spent: %0.3fs \n" % (time() - t0))
print("- RMSE: %.4f \n" % np.sqrt(mean_squared_error(y_pred, y)))
# Explained variance score: 1 is perfect prediction
print('- R2 score: %.2f' % r2_score(y, y_pred))
def plot_residuals_and_predictions(y, y_pred):
# Plot residuals
plt.figure(figsize=(15,7))
sns.regplot(np.expm1(y_pred), np.expm1(y_pred) - np.expm1(y), color='red', fit_reg = True)
plt.title("Residual plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.show()
# Plot predictions
plt.figure(figsize=(15,7))
sns.regplot(np.expm1(y_pred), np.expm1(y), color='green', fit_reg = True)
plt.title("Predicted vs Observed")
plt.xlabel("Predicted values")
plt.ylabel("Observed values")
plt.show()
plot_residuals_and_predictions(y, y_pred)
# Construct a dataframe from the test data csv file
test_data = pd.read_csv(path_data + "test.csv")
test_data['SalePrice'] = test_data['Id'] * 0 # Add a sale price column to the test data
# Construct a dataframe for the submission
submission = pd.DataFrame()
submission['Id'] = test_data['Id'] # Save the ID for submission
# Construct a clean version of the test data ready to be used in our machine learning models
X_test, y_test = prepare_data(test_data)
predictions = ensembled(X_test)
final_predictions = np.expm1(predictions)
submission['SalePrice'] = final_predictions
submission.head()
# (index=False) so Pandas won't create a new index.
submission.to_csv('submission.csv', index=False)